Search CORE

Relevance similarity: an alternative means to monitor information retrieval systems

Author: A Spink
Adrian Mondry
AM Rees
AR Feinstein
CA Cuadra
CL Barry
CL Barry
CW Cleverdon
CW Cleverdon
DL Sackett
DV Cicchetti
EM Voorhees
G Hripcsak
G Peterson
J Castro
J Cohen
J Kekäläinen
JL Fleiss
KC Abbott
L Schamber
L Schamber
Marie Loh
ME Lesk
P Dong
P Dong
P Vakkari
Peng Dong
R Burgin
SP Harter
T Saracevic
TV Kazhdan
V Curro
WR Hersh
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Relevance assessment is a major problem in the evaluation of information retrieval systems. The work presented here introduces a new parameter, "Relevance Similarity", for the measurement of the variation of relevance assessment. In a situation where individual assessment can be compared with a gold standard, this parameter is used to study the effect of such variation on the performance of a medical information retrieval system. In such a setting, Relevance Similarity is the ratio of assessors who rank a given document same as the gold standard over the total number of assessors in the group. METHODS: The study was carried out on a collection of Critically Appraised Topics (CATs). Twelve volunteers were divided into two groups of people according to their domain knowledge. They assessed the relevance of retrieved topics obtained by querying a meta-search engine with ten keywords related to medical science. Their assessments were compared to the gold standard assessment, and Relevance Similarities were calculated as the ratio of positive concordance with the gold standard for each topic. RESULTS: The similarity comparison among groups showed that a higher degree of agreements exists among evaluators with more subject knowledge. The performance of the retrieval system was not significantly different as a result of the variations in relevance assessment in this particular query set. CONCLUSION: In assessment situations where evaluators can be compared to a gold standard, Relevance Similarity provides an alternative evaluation technique to the commonly used kappa scores, which may give paradoxically low scores in highly biased situations such as document repositories containing large quantities of relevant data

PubMed related articles: a probabilistic topic-based model for content similarity

Author: A Berger
A Singhal
C Zhai
CW Cleverdon
D Metzler
DK Harman
EL Margulis
EM Voorhees
EM Voorhees
G Salton
J Lin
Jimmy Lin
K Sparck Jones
M Smucker
S Robertson
SE Robertson
SE Robertson
SP Harter
T Strohman
W Hersh
W John Wilbur
WJ Wilbur
WJ Wilbur
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background We present a probabilistic topic-based model for content similarity called <it>pmra </it>that underlies the related article search feature in PubMed. Whether or not a document is about a particular topic is computed from term frequencies, modeled as Poisson distributions. Unlike previous probabilistic retrieval models, we do not attempt to estimate relevance–but rather our focus is "relatedness", the probability that a user would want to examine a particular document given known interest in another. We also describe a novel technique for estimating parameters that does not require human relevance judgments; instead, the process is based on the existence of MeSH ® in MEDLINE ®. Results The <it>pmra </it>retrieval model was compared against <it>bm25</it>, a competitive probabilistic model that shares theoretical similarities. Experiments using the test collection from the TREC 2005 genomics track shows a small but statistically significant improvement of <it>pmra </it>over <it>bm25 </it>in terms of precision. Conclusion Our experiments suggest that the <it>pmra </it>model provides an effective ranking algorithm for related article search.</p

Digital Repository at the University of Maryland

PageRank without hyperlinks: Reranking with PubMed related article networks for biomedical text retrieval

Author: A Leuski
CJ van Rijsbergen
CW Cleverdon
DK Harman
E Voorhees
F Diaz
G Amati
G Erkan
H Abdi
J Lin
J Lin
J Lin
Jimmy Lin
JM Kleinberg
JP Shaffer
L Page
MA Hearst
MD Smucker
O Kurland
P Pirolli
R Mihalcea
WJ Wilbur
WR Hersh
X Huang
X Liu
Y Lin
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Graph analysis algorithms such as PageRank and HITS have been successful in Web environments because they are able to extract important inter-document relationships from manually-created hyperlinks. We consider the application of these algorithms to related document networks comprised of automatically-generated content-similarity links. Specifically, this work tackles the problem of document retrieval in the biomedical domain, in the context of the PubMed search engine. A series of reranking experiments demonstrate that incorporating evidence extracted from link structure yields significant improvements in terms of standard ranked retrieval metrics. These results extend the applicability of link analysis algorithms to different environments

CiteSeerX

Digital Repository at the University of Maryland

Is searching full text more effective than searching abstracts?

Author: A Singhal
A Trotman
B Larsen
B Rafkind
B Sigurbjörnsson
C Clarke
CW Cleverdon
CW Gay
D Demner-Fushman
D Metzler
DK Harman
G Salton
G Salton
G Salton
H Shatkay
H Yu
H Yu
H Yu
I Tbahriti
J Dean
J Kamps
Jimmy Lin
JM Ponte
JP Callan
K Seki
K Sparck Jones
L Hunter
LA Barroso
M Kaszkiel
M Krallinger
M Lalmas
M Wang
MA Hearst
MJ Schuemie
P Ogilvie
P Zweigenbaum
P Zweigenbaum
PK Shah
R Wilkinson
S Brin
S Ghemawat
S Tellex
SE Robertson
SE Robertson
T Elsayed
WJ Wilbur
WR Hersh
X Liu
Z Kou
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE® abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: <it>bm25 </it>and the ranking algorithm implemented in the open-source Lucene search engine. Results Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles. Conclusion Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations.</p

Digital Repository at the University of Maryland

Searching for musical features using natural language queries: the C@merata evaluations at MediaEval

Author: A Forte
A Hopkins
A Peñas
CW Cleverdon
D Cooke
D Huron
D Huron
D Mollá
D Nadeau
Deane L. Root
Eduard Hovy
G Read
G Salton
JS Downie
JS Downie
M Schedl
N Orio
P Ekman
R Artstein
R Kirkpatrick
Richard Sutcliffe
S Oramas
Stephen Wan
T Dunning
Tim Crawford
Tom Collins
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2019
Field of study

Musicological texts about classical music frequently include detailed technical discussions concerning the works being analysed. These references can be speciﬁc (e.g. C sharp in the treble clef) or general (fugal passage, Thor’s Hammer).Experts can usually identify the features in question in music scores but a means of performing this task automatically could be very useful for experts and beginnersalike. Following work on textual question answering over many years as co-or-ganisers of the QA tasks at the Cross Language Evaluation Forum, we decided in 2013 to propose a new type of task where the input would be a natural language phrase, together with a music score in MusicXML, and the required output would be one or more matching passages in the score. We report here on 3 years of theC@merata task at MediaEval. We describe the design of the task, the evaluation methods we devised for it, the approaches adopted by participant systems and the results obtained. Finally, we assess the progress which has been made in aligning natural language text with music and map out the main steps for the future. The novel aspects of this work are: (1) the task itself, linking musical references to actual music scores, (2) the evaluation methods we devised, based on modiﬁed versions of precision and recall, applied to demarcated musical passages, and (3) the progress which has been made in analysing and interpreting detailed technical references to music within texts

University of Essex Research Repository

Goldsmiths Research Online

What is the Role of NLP in Text Retrieval?

Author: AF Srneaton
B Biebricher
C Buckley
CW Cleverdon
CW Cleverdon
DA Evans
DD Lewis
DJ Hillman
DM Hull
FJ Damerau
FW Lancaster
G Salton
G Salton
G Salton
H Schütze
Harman. D
J Callan
JL Fagan
JL Fagan
JP Silvester
K Sparck Jones
M Dillon
MF Porter
N Bely
PH Klingbiel
PH Klingbiel
PJ Hayes
PS Jacobs
R Krovetz
T Strzalkowski
T Strzalkowski
U Hahn
WJ Hutchins
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/1999
Field of study

This paper addresses the value of linguistically-motivated indexing (LMI) for document and text retrieval. After reviewing the basic concepts involved and the assumptions on which LMI is based, namely that complex index descriptions and terms are necessary, I consider past and recent research on LMI, and specifically on automated LMI via NLP. Experiments in the first phase of research, to the late eighties, did not demonstrate value in LMI, but were very limited; but the much larger tests of the Nineties, with full text, have not done so either. My conclusion is that LMI is not needed for effective retrieval, but has other important roles within information-selection systems. The rapid growth of full text databases, together with developments in natural language processing (NLP) technology, has prompted those engaged with NLP to suggest that it could be usefully applied to text retrieval, primarily for indexing purposes but perhaps also for more or less related tasks such as document ‘abstracting ’ or extracting; it could be applied at shallow text as well as at deep content levels, and for user display or for database creation. Retrieval itself has various modes, including filtering or routing as well as one-off searching

CiteSeerX

A stochastic context free grammar based framework for analysis of protein sequences

Author: A Golovin
A Krogh
AC Wallace
B Feng
B Keller
B Knudsen
B Robson
CJA Sigrist
CW Cleverdon
D Wadowski
DB Searls
DB Searls
DE Goldberg
DT Jones
EM Gold
GD Forney
GE Revesz
H Mamitsuka
HM Berman
I Jonyer
J Arabas
J Davis
J Hopcroft
J Kupiec
J Maczka
Jean-Christophe Nebel
JH Holland
JK Baker
JL Fauchere
JR Koza
K Nakai
K Tomii
KS Pollard
LE Baum
M Mernik
M Wall
MA Jimenez-Montao
MI Kanehisa
N Abe
N Chomsky
N Hulo
NJ Mulder
P Klein
PP Vaidyanathan
PR Dupont
PY Chou
R Durbin
S Eddy
S Geman
S Kawashima
S Lonardi
T Head
T Ishikawa
TK Attwood
UniProt Consortium
V Biou
V Brendel
W Dyrka
W Dyrka
Witold Dyrka
Y Sakakibara
Y Sakakibara
Y Sakakibara
Y Sakakibara
Y Sakakibara
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Abstract Background In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited the application of formal language theory to the production of grammars whose expressive power is not higher than stochastic regular grammars. However, these grammars, like other state of the art methods, cannot cover any higher-order dependencies such as nested and crossing relationships that are common in proteins. In order to overcome some of these limitations, we propose a Stochastic Context Free Grammar based framework for the analysis of protein sequences where grammars are induced using a genetic algorithm. Results This framework was implemented in a system aiming at the production of binding site descriptors. These descriptors not only allow detection of protein regions that are involved in these sites, but also provide insight in their structure. Grammars were induced using quantitative properties of amino acids to deal with the size of the protein alphabet. Moreover, we imposed some structural constraints on grammars to reduce the extent of the rule search space. Finally, grammars based on different properties were combined to convey as much information as possible. Evaluation was performed on sites of various sizes and complexity described either by PROSITE patterns, domain profiles or a set of patterns. Results show the produced binding site descriptors are human-readable and, hence, highlight biologically meaningful features. Moreover, they achieve good accuracy in both annotation and detection. In addition, findings suggest that, unlike current state-of-the-art methods, our system may be particularly suited to deal with patterns shared by non-homologous proteins. Conclusion A new Stochastic Context Free Grammar based framework has been introduced allowing the production of binding site descriptors for analysis of protein sequences. Experiments have shown that not only is this new approach valid, but produces human-readable descriptors for binding sites which have been beyond the capability of current machine learning techniques.</p

Springer

Kingston University Research Repository

Search Engine Ability to Cope With the Changing Web

Author: A Arasu
A Broder
A Singhal
A Spink
A. L. Barabasi
AZ Broder
AZ Broder
BE Brewington
C Holscher
CW Cleverdon
CW Cleverdon
D Hawking
HV Leighton
J Bar-Ilan
J Bar-Ilan
J Cho
J Cho
J. Bar-Ilan. Evaluating the stability of the search tools Hotbot and Snap
J. S. Watson
KM Risvik
M Courtois
M. Gordon and P. Pathak. Finding information on the World Wide Web
MC Bowman
R Albert
RA Baeza-Yates
S Chakrabarti
S Lawrence
S Lawrence
S Mizzaro
T Saracevic
W Ding
W Koehler
W Koehler
W Mettrop
X Zhu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

An Innovative Approach to Data Management and Curation of Experimental Data Generated Through IR Test Collections

Author: AJG Gray
B Zapilko
CL Borgman
CW Cleverdon
DK Harman
EM Voorhees
EM Voorhees
EM Voorhees
G Ioannakis
G Salton
G Silvello
G Silvello
J Allan
M Agosti
M Agosti
M Agosti
M Agosti
M Agosti
M Agosti
M Sanderson
N Ferro
Nicola Ferro
P Buneman
P Forner
S Bowers
SB Davidson
SE Robertson
TG Armstrong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

This paper describes the steps that led to the invention, design and development of the Distributed Information Retrieval Evaluation Campaign Tool (DIRECT) system for managing and accessing the data used and produced within experimental evaluation in Information Retrieval (IR). We present the context in which DIRECT was conceived, its conceptual model and its extension to make the data available on the Web as Linked Open Data (LOD) by enabling and enhancing their enrichment, discoverability and re-use. Finally, we discuss possible further evolutions of the system